gbn: Sender side only packeting resend bug fix #87

ViktorTigerstrom · 2023-11-07T16:42:37Z

Fixes #89

This PR addresses the packeting resend bug through a sender only side fix.

This PR ensures that after we have resent the queue, we will wait until we know that both parties are in sync before we continue to send new packets. This ensures that we don't end up in an indefinitely resend loop due to latency and delayed NACKs by the other party, which could happen prior to this commit.

To understand why we need to await the awaited ACK/NACK after resending the queue, consider the following scenario:

Alice sends packets 1, 2, 3 & 4 to Bob.
Bob receives packets 1, 2, 3 & 4, and sends back the respective ACKs.
Alice receives ACKs for packets 1 & 2, but due to latency the ACKs for packets 3 & 4 are delayed and aren't received until Alice resend timeout has passed, which leads to Alice resending packets 3 & 4. Alice will after that receive the delayed ACKs for packets 3 & 4, but will consider that as the ACKs for the resent packets, and not the original packets which they were actually sent for. If we didn't wait after resending the queue, Alice would then proceed to send more packets (5 & 6).
When Bob receives the resent packets 3 & 4, Bob will respond with NACK 5. Due to latency, the packets 5 & 6 that Alice sent in step (3) above will then be received by Bob, and be processed as the correct response to the NACK 5. Bob will after that await packet 7.
Alice will receive the NACK 5, and now resend packets 5 & 6. But as Bob is now awaiting packet 7, this send will lead to a NACK 7. But due to latency, if Alice doesn't wait resending the queue, Alice will proceed to send new packet(s) before receiving the NACK 7.
This resend loop would continue indefinitely, so we need to ensure that Alice waits after she has resent the queue, to ensure that she doesn't proceed to send new packets before she is sure that both parties are in sync.

To ensure that we are in sync, after we have resent the queue, we will await that we EITHER:

Receive a NACK for the sequence number succeeding the last packet in the resent queue i.e. in step (3) above, that would be NACK 5.
Receive an ACK for the last packet in the resent queue i.e. in step (3) above, that would be ACK 4. After we receive the expected ACK, we will then wait for the duration of the resend timeout before continuing. The reason why we wait for the resend timeout before continuing, is that the ACKs we are getting after a resend, could be delayed ACKs for the
original packets we sent, and not ACKs for the resent packets. In step (3) above, the ACKs for packets 3 & 4 that Alice received were delayed ACKs for the original packets. If Alice would have immediately continued to send new packets (5 & 6) after receiving the ACK 4, she would have then received the NACK 5 from Bob which was the actual response to the resent queue. But as Alice had already continued to send packets 5 & 6 when receiving the NACK 5, the resend queue response to that NACK would cause the resend loop to continue indefinitely.

When either of the 2 conditions above are met, we will consider both parties to be in sync, and we can proceed to send new packets.

TODO:

Update the with extra docs + update commit messages with more detailed docs

ViktorTigerstrom

Updated with docs, and made some small changes to the PR! This is now ready for review 🔥.

I'll address CI failures tomorrow, but the PR should be reviewable even before that happens.

I've confirmed locally through manual testing, that this PR causes no issues if the two parties have different resend timeouts. But, something I'd like to test is the interoperability if the client has updated to the new LNC version, but the server hasn't. Would be good to ensure that that doesn't cause more resends by the server than prior to this PR.

gbn/gbn_conn.go

ellemouton

Nice investigation!!! 🎉

Left a bunch of comments - basically, I think we can simplify and modularise the logic quite a bit by letting the queue handle everything.

gbn/gbn_conn.go

ViktorTigerstrom

Thanks a lot for the review @ellemouton!!

I've addressed your feedback, and I totally agree that it makes much more sense to let the queue struct handle the resend catch up logic instead, so I've moved it there!

Please also check my response here, and let me know what you'd like me to do with the catchUpID: #87 (comment)

gbn/queue.go

ViktorTigerstrom · 2023-11-13T17:59:10Z

The itest CI that failed this round of the CI workflow is pre-existing since before this PR. It works sometimes, and fails sometimes.

ellemouton · 2023-11-15T08:30:17Z

The itest CI that failed this round of the CI workflow is pre-existing since before this PR. It works sometimes, and fails sometimes.

You are able to trigger a manual re-run right?

Also - can you make an issue for this flake so that some investigation can be done?

ViktorTigerstrom · 2023-11-15T09:29:45Z

I've learned how to trigger a re-run of the workflow now :)! Also opened a new issue for the error. Thanks for the nudge :)!

ellemouton

check out this patch for how we can do a bit of cleanup: gbn_cleanup.patch

Left a few comments - Im mainly concerned with adding complexity to the system to solve this issue. I think we should maybe first focus on reducing the complexity and then tackle the issue at hand.

So im thinking we should really try to divide up what exactly is being handled by GBN and what is being handled by queue. I think we can make use of a state-machine model too. Currently it seems like the complexity is coming from the fact that both the queue and the GBN can tricker timeouts & resends. I think this should only be handled by one entity. So for example, if we are in the "resend&sync" state, then nothing can trigger a new resent&sync state since we are already in that state.

Let me know what you think.

gbn/queue.go

ellemouton · 2023-11-15T08:27:25Z

gbn/gbn_conn.go

@@ -578,7 +591,10 @@ func (g *GoBackNConn) receivePacketsForever() error { // nolint:gocyclo
 			}

 		case *PacketACK:
-			gotValidACK := g.sendQueue.processACK(m.Seq)
+			gotValidACK := g.sendQueue.processACK(
+				m.Seq, g.resendTimeout,


if we never change the value of g.resentTimeout, then why do we need to pass it to sendQueue each time we call processACK? It can instead be passed to it on instantiation and be a member of queue right?

So the problem with just setting the resendTimeout through the constructor, is that it the queue's resendTimeout then currently won't get set to the value set by the WithTimeout option (Though that's easily fixed of course).

Though I've fixed that issue in #88 now, by the addition of the TimeoutManager. If you agree, we'll let this PR still pass the timeout in the different function calls to the queue, and then remove that with #88 instead?

gbn/queue.go

gbn/gbn_conn.go

ViktorTigerstrom · 2023-12-13T02:34:45Z

gbn/syncer.go

+		// As we can't be sure that waitForSync cancel listener was
+		// triggered before this one, we send over the cancel channel
+		// again, to make sure that both listeners are triggered.
+		c.reset()


We could potentially cover this instead by closing the cancel channel when executing reset function, and creating a new channel on every sync init (and also use a sync.Once to ensure we only cancel the channel once per sync). But I figured this version is probably a bit more clean.

sync.Once to ensure we only cancel the channel once per sync

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

I think the state machine approach is a bit nicer than closing & recreating channels.

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

Yes exactly, which is why I opted for this solution :)!

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

Hmmm I'm not sure how this would address that we wouldn't need to call the c.reset() both here and at the end of the waitForSync function as we can't be sure which of the select cases here + or in the waitForSync will catch the <-c.cancel case first? We need to ensure that both trigger when c.cancel channel get's sent over. Is this a case where you'd use a channel with a size of 2, or is there a better way to solve this?

It sounds to me like this could be instead handled by a sync.Cond. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.

ellemouton · 2023-12-13T14:12:02Z

I've now added it to this PR, after adjusting it slightly to cover some edge cases 🚀!

Awesome!

Let me know if you'd like me to remove the initial resending loop protection commits, or if you think it makes sense to keep this refactor as a final separate commit.

Yeah - ideal PR structure would be: 1) Add new syncer + test, then 2) inject it & use it in queue . Otherwise we sorta have to review the logic twice. (dont worry about keeping me as commit author for the syncer refactor - it was just putting your code in another file anyways 😊 )

ellemouton

Leaving some comments :)
I think your idea of first adding the syncer & then injecting the usage is a good one and will make reviewing easier since we then only need to review the logic once vs 2 versions of similar logic.

One thing that concerns me is that it seems (from looking at the results of some of the latest GH action runs) that with this PR, the speed of the reconnection test has strictly been increased and so is resulting in more timeouts.

before PR

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (14.81s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.78s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.00s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (11.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (21.29s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (19.86s)


    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.81s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (10.95s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (14.66s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (15.82s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (19.80s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (22.18s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (18.52s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (13.84s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.71s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (15.70s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (21.82s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (33.75s)


with PR

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (9.82s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (28.67s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (24.85s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (16.08s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (27.85s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (41.81s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (23.67s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:true) (55.31s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (29.89s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (29.85s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (12.09s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:true) (29.47s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (38.83s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (35.50s)

    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:false) (14.83s)
    --- PASS: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:false,grpcClientConn:true) (33.87s)
    --- FAIL: TestLightningNodeConnect/test_client_reconnect(stagingMailbox:true,grpcClientConn:false) (33.06s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:false) (31.83s)
    --- PASS: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:false,grpcClientConn:true) (39.80s)
    --- FAIL: TestLightningNodeConnect/test_server_reconnect(stagingMailbox:true,grpcClientConn:false) (32.93s)

Is this perhaps a necessary/expected increase due to the fact that we now back-off a bit?

gbn/syncer.go

gbn/queue_test.go

gbn/syncer.go

ellemouton · 2023-12-13T14:24:17Z

gbn/syncer.go

+		// As we can't be sure that waitForSync cancel listener was
+		// triggered before this one, we send over the cancel channel
+		// again, to make sure that both listeners are triggered.
+		c.reset()


sync.Once to ensure we only cancel the channel once per sync

But then you would need to re-init the sync.Once each time too which kinda defeats the purpose right?

I think the state machine approach is a bit nicer than closing & recreating channels.

If you want to make sure that the reset only does something if we actually need it to,
then, in the reset method, you could check that we are in the "Resending" state, if we are not, it becomes a no-op. That way, the state machine becomes a bit more safe. we could add the same type of check anywhere else where we change the state

gbn/queue.go

ellemouton · 2023-12-14T10:58:50Z

gbn/queue.go

@@ -101,8 +118,6 @@ func (q *queue) resend(cb func(packet *PacketData) error) error {
 		return nil
 	}

-	q.lastResend = time.Now()


where are we setting it now thought?

Ah good catch, my bad!

ellemouton · 2023-12-14T11:00:10Z

gbn/queue.go

@@ -112,36 +127,48 @@ func (q *queue) resend(cb func(packet *PacketData) error) error {
 	q.topMtx.RUnlock()

 	if base == top {
+		q.syncer.reset()


why do we need a reset here?

Initially in my diff I had it here cause there was an init state above this. but if that is removed then this can be removed right?

btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.

if we still want there to be no other "resend" call during the time this is going, then we should add that back.

but i think this is synchronous for now right?

why do we need a reset here

Ah my bad, missed removing that. Thanks!

btw - the reason i had the other "init" state here was to remove the need for locking the awaitMu mutex before doing the seqBase and seqTop locks.

Yeah, it's synchronous currently, so we will never call the queue.resend function twice at the same time while it's currently resending the function.
Since we now process the awaited ACK+NACK while we are also resending packets, we now need to ensure that we set the syncStateResending while under the same lock that we also set the awaitedACK + awaitedNACK which is why I changed this. But like you say, if we add an extra init state to the syncer we could achieve that we wouldn't allow two concurrent calls of the queue.resend function. Opted to not add it now though to not cause extra confusion with unused functionality, butcan change it if you'd prefer :).

ellemouton · 2023-12-14T11:13:41Z

gbn/queue.go

+		if seq == q.sequenceTop {
+			q.sequenceBase = q.sequenceTop
+		}
+
+		// If we receive the awaited NACK, we shouldn't trigger a new
+		// resend, as we can now proceed to send new packets.
+		return false, false
+	}
+
+	// If the NACK is the same as sequenceTop, and we weren't awaiting this
+	// NACK as part of the resend catch up, it probably means that queue
 	// was sent successfully, but we just missed the necessary ACKs. So we
-	// can empty the queue here by bumping the base and we dont need to
+	// can empty the queue here by bumping the base and we don't need to
 	// trigger a resend.
 	if seq == q.sequenceTop {
 		q.sequenceBase = q.sequenceTop
-		return true, false
+
+		return false, false
 	}

 	// Is the NACKed sequence even in our queue?
 	if !containsSequence(q.sequenceBase, q.sequenceTop, seq) {
+		q.cfg.log.Tracef("NACK seq %d is not in the queue. Ignoring.",


it seems the "bumped" logic is changed here only to be changed again in a later commit...

If there is a bug in the original logic - perhaps better to fix that in it's own commit before any additions. Cause these 2 diffs are a bit confusing to follow

Yeah good idea, cleaned this up with the latest push :)!

ViktorTigerstrom · 2023-12-16T03:05:51Z

Thanks a lot for the review @ellemouton! I've updated the commit history quite extensively with the latest push, and addressed your feedback, so I hope this version looks better :)! 🚀

Is this perhaps a necessary/expected increase due to the fact that we now back-off a bit?

Yeah this was part of why I added the IsPingPacket logic earlier in the PR that we later remove, as these reconnect tests now often trigger a resend syncing with ping packet which will time out (i.e. await the resendTimeout*3).
So it is expected, but a bit unfortunate. I'll experiment a bit with different timeouts later when addressing #89, or is the timeouts something you'd like me to update with this PR :)?

guggero

Code looks good to me, great work!
But it seems like the server or client disconnect integration tests fail pretty consistently. Do you think that's something that needs to be addressed with different timeouts (e.g. #88) or should they pass reliably for this PR too?

I'm happy to merge this as-is if one of the PRs down the road show them passing reliably.

guggero · 2023-12-20T11:08:23Z

gbn/syncer.go

+		// As we can't be sure that waitForSync cancel listener was
+		// triggered before this one, we send over the cancel channel
+		// again, to make sure that both listeners are triggered.
+		c.reset()


It sounds to me like this could be instead handled by a sync.Cond. But IMO that is harder to understand (and also not too easy to use in the first place). So I leave it to you if you want to refactor or leave it the way it is now.

lightninglabs-deploy · 2024-01-03T12:36:16Z

@ellemouton: review reminder

ellemouton · 2024-01-03T13:46:13Z

I think the tests are timing out since we now don’t just keep resending right? the test that times out is the reconnection test - meaning that packets are getting dropped. So before, we would resend more frequently & now we back-off quite a bit.

So I think if we are confident that this PR is an improvement on reliability then we should just increase the itest timeout to whatever the average timeout required by that test is. Looks like 1 min should be fine right?

In order to ensure reliability, I suggest running a simulation overnight that keeps sending packets back and forth and occasionally does a restart of the aperture server - the test is successful if by the end of the test (after 12 hours or w/e) the client and server are still in sync & no resend loops are occurring

ViktorTigerstrom · 2024-01-04T02:55:24Z

Ok so I've finally figured out what's going on in the itests! There were 2 core problems, 1 which was caused by a bug in this PR, and 1 which is pre-existing:

Issue 1:

The bug which was caused by this PR mainly effected the testClientReconnect test and was caused by me missing that when calling g.resendTicker.Reset(g.cfg.resendTimeout) in the resendQueue func in GoBackNConn.sendPacketsForever, that doesn't also drain the ticker's channel if there's a tick that has occoured before reseting the timer.
That caused the resendQueue func to be called over and over again when the queue syncer times out, and the client is down and never sending any ACKS/NACKS back, as the ticker always ticks when the syncer times out.

I've addressed this with the latest push!

Issue 2:

The problem is pre-existing, and is not made worse by this PR. This is effecting the testServerReconnect test, and is what happening for #89.

So the core issue is that when we stop the server here, the server sends a FIN packet, and then closes the connection:

lightning-node-connect/itest/connection_test.go

Line 113 in fa60171

t.server.stop()

We then verify that client's connection status has change to mailbox.ClientStatusSessionNotFound on the line below.

But as there's a sliiiight delay on between when the FIN packet is sent, and when the connection is closed, 25-50% of the time when running the test, if the client receives the FIN packet from the server and calls recvFromStream in GoBackNConn.receivePacketsForever before the connection has been closed, the client will never move the connection status to mailbox.ClientStatusSessionNotFound.

When client then processes the FIN packet, the client will then close down and never change the the connection status.

Therefore the predicate in the testServerReconnect test that verifies that the connection status changes to mailbox.ClientStatusSessionNotFound within the timeout will never be satisfied, no matter how long the timeout is set to.

For the cases when the test is run and instead actually succeeds, the connection is already closed when the connection recvFromStream call is made, and the status is changed as per the implementation here (as c.transport.Recv(ctx) will error):

lightning-node-connect/mailbox/client_conn.go

Line 310 in fa60171

c.setStatus(errStatus)

I'm not a 100% sure how we should handle this correctly, so would appreciate any feedback on how you think this should be handled best! One potential solution that I experimented with would be to trigger the mailbox client_conn to recheck the connection status once the client's gbn_conn processes a FIN packet in the receive function. That will change the client's connection status to change to ClientStatusNotConnected after the connection has been closed, so given that we change the test verification predicate to also accept that status, the test will pass in this case. Does that sound like a reasonable solution, or do you have a better suggestion?

Finally, it's expected that the itests should take a bit longer for the reconnection tests as we will now not resend as frequently and back-off quite a bit as @ellemouton points out above. However I actually don't think we need to increase any timeouts, and it's mainly the 2 issues I've detailed above that's been causing issues for the itests 🚀!

In order to ensure reliability, I suggest running a simulation overnight that keeps sending packets back and forth and occasionally does a restart of the aperture server - the test is successful if by the end of the test (after 12 hours or w/e) the client and server are still in sync & no resend loops are occurring

This is a great idea, I'll implement that simulation asap and let it run for a while as suggested 🔥!

ellemouton · 2024-01-04T12:14:18Z

Really nice investigation @ViktorTigerstrom 🔥 🔥

regarding how to solve for issue 2, could an onFIN callback work? see:
onFin.patch

ViktorTigerstrom

Thanks a lot for the patch @ellemouton! That worked really well 🔥🚀!
I've added it on top of this PR, so itests are now finally consistently passing 🎉!

ViktorTigerstrom · 2024-01-04T20:06:09Z

mailbox/client_conn.go

+		gbn.WithKeepalivePing(
+			gbnClientPingTimeout, gbnPongTimeout,
+		),
+		gbn.WithOnFIN(func() {


We could potentially add this to the server_conn.go version as well, and force the server to set the connection status to status ServerStatusIdle as well. But I haven't really seen any case where this has caused any issues on the server side yet, so opted to not add it. But Will add it to the server side as well if you think I should!.

guggero

Awesome investigation on the itest flake hunt! Very nice to have things pass so consistently now.

Great work, LGTM 🎉

ellemouton

tACK, LGTM! Great investigations & great work!! 💥 🚀

The majority of my comments are spelling/grammar nits

gbn/queue.go

gbn/syncer.go

gbn/syncer_test.go

gbn/gbn_conn.go

ViktorTigerstrom · 2024-01-05T13:47:58Z

Thanks a lot for the review @ellemouton!
Left some comments on mainly #87 (comment) & #87 (comment). Once that has been resolved, I think this should be good to go!

The processNACK function returns two booleans as its signature. The first boolean used to only indicate if the NACK was part of the queue, but is with this commit expanded to indicate if the queue needs to be resent or not as a result of the NACK. The second boolean indicates if the queue sequence base was bumped or not. Previously, the bumped value for processNACK didn't accurately reflect if the queue sequence base was actually bumped or not for the cases when we didn't trigger a resend. This commit addresses that issue. Updating the bumped value to be correct has no real impact on the current implementation of the gbn conn, but it might start having an effect if the implementation is modified in the future.

The syncer struct can be used to ensure that both the sender and the receiver are in sync before the waitForSync function is completed. This ensures that we don't end up in a loop where one party keeps resending packets over and over again. This is done by awaiting that we receive either the an expected ACK or an expected NACK after resending the queue. The expected ACK is the ACK for the last packet that has been resent, and the expected NACK is the NACK sequence following the last packet that has been resent. To understand why we need to await the awaited ACK/NACK after resending the queue, consider the following scenario: 1. Alice sends packets 1, 2, 3 & 4 to Bob. 2. Bob receives packets 1, 2, 3 & 4, and sends back the respective ACKs. 3. Alice receives ACKs for packets 1 & 2, but due to latency the ACKs for packets 3 & 4 are delayed and aren't received until Alice resend timeout has passed, which leads to Alice resending packets 3 & 4. Alice will after that receive the delayed ACKs for packets 3 & 4, but will consider that as the ACKs for the resent packets, and not the original packets which they were actually sent for. If we didn't wait after resending the queue, Alice would then proceed to send more packets (5 & 6). 4. When Bob receives the resent packets 3 & 4, Bob will respond with NACK 5. Due to latency, the packets 5 & 6 that Alice sent in step (3) above will then be received by Bob, and be processed as the correct response to the NACK 5. Bob will after that await packet 7. 5. Alice will receive the NACK 5, and now resend packets 5 & 6. But as Bob is now awaiting packet 7, this send will lead to a NACK 7. But due to latency, if Alice doesn't wait resending the queue, Alice will proceed to send new packet(s) before receiving the NACK 7. 6. This resend loop would continue indefinitely, so we need to ensure that Alice waits after she has resent the queue, to ensure that she doesn't proceed to send new packets before she is sure that both parties are in sync. To ensure that we are in sync, after we have resent the queue, we will await that we either: 1. Receive a NACK for the sequence number succeeding the last packet in the resent queue i.e. in step (3) above, that would be NACK 5. OR 2. Receive an ACK for the last packet in the resent queue i.e. in step (3) above, that would be ACK 4. After we receive the expected ACK, we will then wait for the duration of the resend timeout before continuing. The reason why we wait for the resend timeout before continuing, is that the ACKs we are getting after a resend, could be delayed ACKs for the original packets we sent, and not ACKs for the resent packets. In step (3) above, the ACKs for packets 3 & 4 that Alice received were delayed ACKs for the original packets. If Alice would have immediately continued to send new packets (5 & 6) after receiving the ACK 4, she would have then received the NACK 5 from Bob which was the actual response to the resent queue. But as Alice had already continued to send packets 5 & 6 when receiving the NACK 5, the resend queue response to that NACK would cause the resend loop to continue indefinitely. When either of the 2 conditions above are met, we will consider both parties to be in sync, and we can proceed to send new packets.

When the client receives+processes a FIN packet, we force the connection to status ClientStatusSessionNotFound, as in rare occasions the corresponding server may have time to close the connection before we've already processed the sent FIN packet by the server. In that case, if we didn't force a new status, the client would never mark the connection as status ClientStatusSessionNotFound. This solves a bug for the testServerReconnect itest, as that test would sometimes fail due to the FIN packet being processed but without changing the connection status, leading to a test predicate never being satisfied.

ViktorTigerstrom mentioned this pull request Nov 9, 2023

gbn: set resend timeout dynamically #88

Merged

levmi assigned ViktorTigerstrom Nov 9, 2023

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 57e30f5 to 764fe2f Compare November 10, 2023 01:08

ViktorTigerstrom commented Nov 10, 2023

View reviewed changes

gbn/gbn_conn.go Outdated Show resolved Hide resolved

gbn/gbn_conn.go Outdated Show resolved Hide resolved

ViktorTigerstrom requested a review from ellemouton November 10, 2023 01:31

ellemouton reviewed Nov 10, 2023

View reviewed changes

gbn/gbn_conn.go Outdated Show resolved Hide resolved

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch 9 times, most recently from 8f8ecc9 to 09b83a0 Compare November 13, 2023 16:04

ViktorTigerstrom commented Nov 13, 2023

View reviewed changes

gbn/queue.go Outdated Show resolved Hide resolved

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 09b83a0 to dd6046e Compare November 13, 2023 17:54

ViktorTigerstrom requested a review from ellemouton November 13, 2023 17:54

ellemouton reviewed Nov 15, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch 7 times, most recently from f449c3a to a351b99 Compare November 17, 2023 18:58

ViktorTigerstrom commented Dec 13, 2023

View reviewed changes

ellemouton reviewed Dec 14, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from a1ee556 to b101a16 Compare December 16, 2023 02:41

ViktorTigerstrom requested a review from ellemouton December 16, 2023 03:05

guggero approved these changes Dec 20, 2023

View reviewed changes

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from b101a16 to 82af2e2 Compare January 3, 2024 00:58

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 82af2e2 to c95b563 Compare January 4, 2024 02:35

guggero self-requested a review January 4, 2024 08:19

ViktorTigerstrom commented Jan 4, 2024

View reviewed changes

guggero approved these changes Jan 5, 2024

View reviewed changes

ellemouton approved these changes Jan 5, 2024

View reviewed changes

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 4998324 to 1e9d938 Compare January 5, 2024 13:41

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from 1e9d938 to f41a037 Compare January 8, 2024 10:30

ViktorTigerstrom added 8 commits January 8, 2024 11:33

gbn: update processACK & processNACK docs

4c3528c

gbn: use the syncer in the gbn queue

768ef7a

gbn: reset resendTicker on any msg

4a43139

.golangci.yml: disable gomnd in linter

6f969c9

gbn: increase NACK wait time by 2X

77d0bd5

ViktorTigerstrom force-pushed the 2023-10-client-side-packeting-bug-fix branch from f41a037 to 5e16121 Compare January 8, 2024 10:33

ellemouton merged commit 445d935 into lightninglabs:master Jan 8, 2024
5 checks passed

gbn: Sender side only packeting resend bug fix #87

gbn: Sender side only packeting resend bug fix #87

Conversation

ViktorTigerstrom commented Nov 7, 2023 • edited Loading

ViktorTigerstrom left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

ViktorTigerstrom left a comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Nov 13, 2023

ellemouton commented Nov 15, 2023

ViktorTigerstrom commented Nov 15, 2023

ellemouton left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ellemouton commented Dec 13, 2023 • edited Loading

ellemouton left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Dec 16, 2023

guggero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lightninglabs-deploy commented Jan 3, 2024

ellemouton commented Jan 3, 2024

ViktorTigerstrom commented Jan 4, 2024

Issue 1:

Issue 2:

ellemouton commented Jan 4, 2024

ViktorTigerstrom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

guggero left a comment

Choose a reason for hiding this comment

ellemouton left a comment

Choose a reason for hiding this comment

ViktorTigerstrom commented Jan 5, 2024

ViktorTigerstrom commented Nov 7, 2023 •

edited

Loading

ViktorTigerstrom Nov 17, 2023 •

edited

Loading

ellemouton commented Dec 13, 2023 •

edited

Loading

ellemouton left a comment •

edited

Loading